{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Search and Summarize: AI-Powered Web Research Tool\n",
    "\n",
    "## Overview\n",
    "This Jupyter notebook implements an intelligent web research assistant that combines web search capabilities with AI-powered summarization. It automates the process of gathering information from the internet and distilling it into concise, relevant summaries, enhancing the efficiency of online research tasks.\n",
    "\n",
    "## Motivation\n",
    "In the age of information overload, efficiently extracting relevant knowledge from the vast expanse of the internet is increasingly challenging. This tool addresses several key pain points:\n",
    "\n",
    "1. Time consumption in manual web searches\n",
    "2. Information overload from multiple sources\n",
    "3. Difficulty in quickly grasping key points from lengthy articles\n",
    "4. Need for focused research on specific websites\n",
    "\n",
    "By automating the search and summarization process, this tool aims to significantly reduce the time and cognitive load associated with web research, allowing users to quickly gain insights on any topic.\n",
    "\n",
    "## Key Components\n",
    "The notebook consists of several integral components:\n",
    "\n",
    "1. **Web Search Module**: Utilizes DuckDuckGo's search API to fetch relevant web pages based on user queries.\n",
    "2. **Result Parser**: Processes raw search results into a structured format for further analysis.\n",
    "3. **Text Summarization Engine**: Leverages OpenAI's language models to generate concise summaries of web content.\n",
    "4. **Integration Layer**: Combines the search and summarization functionalities into a seamless workflow.\n",
    "\n",
    "## Method Details\n",
    "\n",
    "### Web Search Process\n",
    "1. The user provides a search query and optionally specifies a target website.\n",
    "2. If a specific site is given, the tool performs two searches:\n",
    "   a. A site-specific search within the specified domain\n",
    "   b. A general search excluding the specified site\n",
    "3. Without a specific site, it conducts a general web search.\n",
    "4. Search results are parsed to extract snippets, titles, and links.\n",
    "\n",
    "### Summarization Approach\n",
    "1. For each search result, the tool extracts the relevant text content.\n",
    "2. The extracted text is sent to the AI model with a prompt requesting a concise summary.\n",
    "3. The AI generates a summary in the form of 1-2 bullet points, capturing the key information.\n",
    "4. Summaries are compiled along with their sources (title and link).\n",
    "\n",
    "### Integration and Output\n",
    "1. The tool combines the web search and summarization processes into a single function call.\n",
    "2. It returns a formatted output containing summaries from multiple sources, each clearly attributed.\n",
    "3. The output is designed to provide a quick overview of the topic, with links to full sources for further reading.\n",
    "\n",
    "## Conclusion\n",
    "This notebook demonstrates the power of combining web search capabilities with AI-driven summarization. It offers a practical solution for rapid information gathering and synthesis, applicable in various domains such as research, journalism, business intelligence, and general knowledge acquisition. By automating the tedious aspects of web research, it allows users to focus on higher-level analysis and decision-making based on quickly acquired, relevant information.\n",
    "\n",
    "The modular design of this tool also allows for future enhancements, such as integration with different search engines, implementation of more advanced summarization techniques, or adaptation to specific domain knowledge requirements."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Dependencies\n",
    "\n",
    "This cell imports all necessary libraries and sets up the environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from langchain.tools import DuckDuckGoSearchResults\n",
    "from langchain_openai import ChatOpenAI\n",
    "from langchain import PromptTemplate\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "from typing import List, Dict, Any, Tuple, Optional\n",
    "import re\n",
    "import nltk\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "# Download necessary NLTK data\n",
    "nltk.download('punkt', quiet=True)\n",
    "nltk.download('stopwords', quiet=True)\n",
    "\n",
    "# Load environment variables\n",
    "load_dotenv()\n",
    "\n",
    "# Set OpenAI API key\n",
    "os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize DuckDuckGo Search\n",
    "\n",
    "This cell initializes the DuckDuckGo search tool."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "search = DuckDuckGoSearchResults()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Data Models\n",
    "\n",
    "This cell defines the data model for text summarization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SummarizeText(BaseModel):\n",
    "    \"\"\"Model for text to be summarized.\"\"\"\n",
    "    text: str = Field(..., title=\"Text to summarize\", description=\"The text to be summarized\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Helper Functions\n",
    "\n",
    "This section contains helper functions for parsing search results and performing web searches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_search_results(results_string: str) -> List[dict]:\n",
    "    \"\"\"Parse a string representation of search results into a list of dictionaries.\"\"\"\n",
    "    results = []\n",
    "    entries = results_string.split(', snippet: ')\n",
    "    for entry in entries[1:]:  # Skip the first split as it's empty\n",
    "        parts = entry.split(', title: ')\n",
    "        if len(parts) == 2:\n",
    "            snippet = parts[0]\n",
    "            title_link = parts[1].split(', link: ')\n",
    "            if len(title_link) == 2:\n",
    "                title, link = title_link\n",
    "                results.append({\n",
    "                    'snippet': snippet,\n",
    "                    'title': title,\n",
    "                    'link': link\n",
    "                })\n",
    "    return results\n",
    "\n",
    "\n",
    "def perform_web_search(query: str, specific_site: Optional[str] = None) -> Tuple[List[str], List[Tuple[str, str]]]:\n",
    "    \"\"\"Perform a web search based on a query, optionally including a specific website.\"\"\"\n",
    "    try:\n",
    "        if specific_site:\n",
    "            specific_query = f\"site:{specific_site} {query}\"\n",
    "            print(f\"Searching for: {specific_query}\")\n",
    "            specific_results = search.run(specific_query)\n",
    "            print(f\"Specific search results: {specific_results}\")\n",
    "            specific_parsed = parse_search_results(specific_results)\n",
    "            \n",
    "            general_query = f\"-site:{specific_site} {query}\"\n",
    "            print(f\"Searching for: {general_query}\")\n",
    "            general_results = search.run(general_query)\n",
    "            print(f\"General search results: {general_results}\")\n",
    "            general_parsed = parse_search_results(general_results)\n",
    "            \n",
    "            combined_results = (specific_parsed + general_parsed)[:3]\n",
    "        else:\n",
    "            print(f\"Searching for: {query}\")\n",
    "            web_results = search.run(query)\n",
    "            print(f\"Web results: {web_results}\")\n",
    "            combined_results = parse_search_results(web_results)[:3]\n",
    "        \n",
    "        web_knowledge = [result.get('snippet', '') for result in combined_results]\n",
    "        sources = [(result.get('title', 'Untitled'), result.get('link', '')) for result in combined_results]\n",
    "        \n",
    "        print(f\"Processed web_knowledge: {web_knowledge}\")\n",
    "        print(f\"Processed sources: {sources}\")\n",
    "        return web_knowledge, sources\n",
    "    except Exception as e:\n",
    "        print(f\"Error in perform_web_search: {str(e)}\")\n",
    "        import traceback\n",
    "        traceback.print_exc()\n",
    "        return [], []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text Summarization Function\n",
    "\n",
    "This cell defines the function to summarize text using OpenAI's language model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def summarize_text(text: str, source: Tuple[str, str]) -> str:\n",
    "    \"\"\"Summarize the given text using OpenAI's language model.\"\"\"\n",
    "    try:\n",
    "        llm = ChatOpenAI(temperature=0.7, model=\"gpt-4o-mini\")\n",
    "        prompt_template = \"Please summarize the following text in 1-2 bullet points:\\n\\n{text}\\n\\nSummary:\"\n",
    "        prompt = PromptTemplate(\n",
    "            template=prompt_template,\n",
    "            input_variables=[\"text\"],\n",
    "        )\n",
    "        summary_chain = prompt | llm\n",
    "        input_data = {\"text\": text}\n",
    "        summary = summary_chain.invoke(input_data)\n",
    "        \n",
    "        summary_content = summary.content if hasattr(summary, 'content') else str(summary)\n",
    "        \n",
    "        formatted_summary = f\"Source: {source[0]} ({source[1]})\\n{summary_content.strip()}\\n\"\n",
    "        return formatted_summary\n",
    "    except Exception as e:\n",
    "        print(f\"Error in summarize_text: {str(e)}\")\n",
    "        return \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Main Search and Summarize Function\n",
    "\n",
    "This cell defines the main function that combines web search and text summarization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def search_summarize(query: str, specific_site: Optional[str] = None) -> str:\n",
    "    \"\"\"Perform a web search and summarize the results.\"\"\"\n",
    "    web_knowledge, sources = perform_web_search(query, specific_site)\n",
    "    \n",
    "    if not web_knowledge or not sources:\n",
    "        print(\"No web knowledge or sources found.\")\n",
    "        return \"\"\n",
    "    \n",
    "    summaries = [summarize_text(knowledge, source) for knowledge, source in zip(web_knowledge, sources) if summarize_text(knowledge, source)]\n",
    "    \n",
    "    combined_summary = \"\\n\".join(summaries)\n",
    "    return combined_summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Usage\n",
    "\n",
    "This cell demonstrates how to use the search_summarize function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Searching for: site:https://www.nature.com What are the latest advancements in artificial intelligence?\n",
      "Specific search results: snippet: The latest advancements in generative artificial intelligence (AI) models have enabled the creation of realistic representations learned from vast amounts of data., title: Place identity: a generative AI's perspective, link: https://www.nature.com/articles/s41599-024-03645-7, snippet: Powered by a large language model (LLM) and trained on much of the text published on the Internet, the artificial intelligence (AI) chatbot, created by OpenAI in San Francisco, California, makes ..., title: Chatbots in science: What can ChatGPT do for you? - Nature, link: https://www.nature.com/articles/d41586-024-02630-z, snippet: Artificial-intelligence tools are transforming data-driven science — better ethical standards and more robust data curation are needed to fuel the boom and prevent a bust. NEWS FEATURE, title: Science and the new age of AI - Nature, link: https://www.nature.com/immersive/d41586-023-03017-2/index.html, snippet: Artificial intelligence (AI) systems, such as the chatbot ChatGPT, have become so advanced that they now very nearly match or exceed human performance in tasks including reading comprehension ..., title: AI now beats humans at basic tasks — new benchmarks are ... - Nature, link: https://www.nature.com/articles/d41586-024-01087-4\n",
      "Searching for: -site:https://www.nature.com What are the latest advancements in artificial intelligence?\n",
      "General search results: snippet: The web page does not mention any recent technological breakthrough in artificial intelligence, but it predicts four hot trends for 2024, such as customized chatbots, generative video, and AI ..., title: What's next for AI in 2024 | MIT Technology Review, link: https://www.technologyreview.com/2024/01/04/1086046/whats-next-for-ai-in-2024/, snippet: How generative AI tools like ChatGPT changed the internet and our daily interactions with it. Learn about the latest developments and challenges of AI in 2024, from chatbots to image-making models., title: AI for everything: 10 Breakthrough Technologies 2024, link: https://www.technologyreview.com/2024/01/08/1085096/artificial-intelligence-generative-ai-chatgpt-open-ai-breakthrough-technologies, snippet: As we approach 2024, experts foresee a blossoming interest in AI ethics education and a heightened prioritization of ethical considerations within AI research and development realms. 2. Augmented ..., title: The 5 Biggest Artificial Intelligence Trends For 2024 - Forbes, link: https://www.forbes.com/sites/bernardmarr/2023/11/01/the-top-5-artificial-intelligence-trends-for-2024/, snippet: Learn how generative AI will evolve in the next year, from more realistic expectations and multimodal models to smaller language models and open source advancements. Discover the challenges and opportunities for enterprises and users in the AI landscape., title: The Top Artificial Intelligence Trends | IBM, link: https://www.ibm.com/think/insights/artificial-intelligence-trends\n",
      "Processed web_knowledge: ['Powered by a large language model (LLM) and trained on much of the text published on the Internet, the artificial intelligence (AI) chatbot, created by OpenAI in San Francisco, California, makes ...', 'Artificial-intelligence tools are transforming data-driven science — better ethical standards and more robust data curation are needed to fuel the boom and prevent a bust. NEWS FEATURE', 'Artificial intelligence (AI) systems, such as the chatbot ChatGPT, have become so advanced that they now very nearly match or exceed human performance in tasks including reading comprehension ...']\n",
      "Processed sources: [('Chatbots in science: What can ChatGPT do for you? - Nature', 'https://www.nature.com/articles/d41586-024-02630-z'), ('Science and the new age of AI - Nature', 'https://www.nature.com/immersive/d41586-023-03017-2/index.html'), ('AI now beats humans at basic tasks — new benchmarks are ... - Nature', 'https://www.nature.com/articles/d41586-024-01087-4')]\n",
      "Summary of latest advancements in AI (including information from https://www.nature.com):\n",
      "Source: Chatbots in science: What can ChatGPT do for you? - Nature (https://www.nature.com/articles/d41586-024-02630-z)\n",
      "- OpenAI's AI chatbot, developed in San Francisco, utilizes a large language model trained on extensive internet text. \n",
      "- The chatbot is designed to generate human-like responses and assist users in various tasks.\n",
      "\n",
      "Source: Science and the new age of AI - Nature (https://www.nature.com/immersive/d41586-023-03017-2/index.html)\n",
      "- Artificial intelligence is revolutionizing data-driven science, but there is a need for improved ethical standards and enhanced data curation to support sustainable growth in the field.\n",
      "- Ensuring robust ethical practices and data management is essential to prevent potential setbacks in the advancement of AI in scientific research.\n",
      "\n",
      "Source: AI now beats humans at basic tasks — new benchmarks are ... - Nature (https://www.nature.com/articles/d41586-024-01087-4)\n",
      "- AI systems like ChatGPT are approaching or surpassing human performance in various tasks, including reading comprehension.\n",
      "- The advancements in AI technology highlight their growing capabilities and effectiveness in processing and understanding information.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "query = \"What are the latest advancements in artificial intelligence?\"\n",
    "specific_site = \"https://www.nature.com\"  # Optional: specify a site or set to None\n",
    "result = search_summarize(query, specific_site)\n",
    "print(f\"Summary of latest advancements in AI (including information from {specific_site if specific_site else 'various sources'}):\")\n",
    "print(result)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
